Introduction

This report will be using R and exploratory data analysis techniques to look at a dataset about white wine quality. The dataset that will be explored in this analysis is “Modeling wine preferences by data mining from physicochemical properties”. The reference information can be found in the References section at the end of this report.

The dataset contains several physicochemical attributes from samples of white wine of the Portuguese “Vinho Verde” and has sensory classifications made by wine experts.

Univariate exploration and plots of the data

Taking a look at the data, this is what we find:

## 'data.frame':    4898 obs. of  12 variables:
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

There are 12 variables and 4898 observations as we can see in the data. The variables are all of the numeric type, with the Quality being explicitly of the integer type.

The variables are based on the physicochemical tests, and are as follows along with explanations of what they entail:

A summary of the data shows its variability as shown:

##     alcohol      residual.sugar     chlorides         sulphates     
##  Min.   : 8.00   Min.   : 0.600   Min.   :0.00900   Min.   :0.2200  
##  1st Qu.: 9.50   1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.:0.4100  
##  Median :10.40   Median : 5.200   Median :0.04300   Median :0.4700  
##  Mean   :10.51   Mean   : 6.391   Mean   :0.04577   Mean   :0.4898  
##  3rd Qu.:11.40   3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.:0.5500  
##  Max.   :14.20   Max.   :65.800   Max.   :0.34600   Max.   :1.0800  
##  free.sulfur.dioxide total.sulfur.dioxide  citric.acid     volatile.acidity
##  Min.   :  2.00      Min.   :  9.0        Min.   :0.0000   Min.   :0.0800  
##  1st Qu.: 23.00      1st Qu.:108.0        1st Qu.:0.2700   1st Qu.:0.2100  
##  Median : 34.00      Median :134.0        Median :0.3200   Median :0.2600  
##  Mean   : 35.31      Mean   :138.4        Mean   :0.3342   Mean   :0.2782  
##  3rd Qu.: 46.00      3rd Qu.:167.0        3rd Qu.:0.3900   3rd Qu.:0.3200  
##  Max.   :289.00      Max.   :440.0        Max.   :1.6600   Max.   :1.1000  
##  fixed.acidity          pH           density          quality     
##  Min.   : 3.800   Min.   :2.720   Min.   :0.9871   Min.   :3.000  
##  1st Qu.: 6.300   1st Qu.:3.090   1st Qu.:0.9917   1st Qu.:5.000  
##  Median : 6.800   Median :3.180   Median :0.9937   Median :6.000  
##  Mean   : 6.855   Mean   :3.188   Mean   :0.9940   Mean   :5.878  
##  3rd Qu.: 7.300   3rd Qu.:3.280   3rd Qu.:0.9961   3rd Qu.:6.000  
##  Max.   :14.200   Max.   :3.820   Max.   :1.0390   Max.   :9.000

A visualization of the variability of each variable by plotting each using a boxplot will provide a baseline:

## Using  as id variables

Now taking a look at each individual variable to explore it more closely.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

The alcohol concentration distribution is right skewed a little. The highest peak of the distribution is at 9.5 percent alcohol and the median value is 10.40 percent. The maximum amount of alcohol present in the observations is 14.20 percent by volume.

Residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Taking a closer look:

The distribution of residual sugar has a median value of 5.2 g/dm^3. The distribution is right skewed with a long tail on the right side. There are several observations that appear to possibly be outliers to the far right. A second plot with them removed is shown as well for clarity.

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Taking a closer look:

The distribution of chlorides in the wine samples has a median value of 0.043 g/dm^3. It looks like there are outliers to the right along its tail, with its max at 0.346 g/dm^3. A second plot with them removed is shown.

Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

And here is a closer look:

The distribution of sulphates is slightly right skewed. The median value of the sulphates is 0.470 and most of the wines have a concentration between 0.410 and 0.550.

Free sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

A closer look at it:

The distribution of free sulfur dioxide is shown, and is right skewed, with a maximum of 289. There appear to be some outliers as there are few observations between 100 and 289. The median value is 34 mg/dm^3 of free sulfur dioxide.

Total sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

A zoomed in look:

The distribution of total sulfur dioxide is right skewed with a median value of 134 mg/dm^3. There appears to be some outliers, as there are few observations between roughly 260 and 440.

Citric acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Looking closer at it:

Median of the wines tested have .320 g/dm^3 of citric acid, this acid is usually only found in very small concentrations in wine it seems. There appear to be some outliers with above 1 g/dm^3 of citric acid.

Volatile acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Looking at it zoomed in:

The median value is 0.260. Most of the observations fall in the range 0.210 - 0.320 and outliers are on the higher end of the range roughly above the .9 g/dm^3.

Fixed Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

Getting a closer look:

The median fixed acidity for the white wines in the dataset is 6.80 g/dm^3. Most of the wines tested have an acidity between 6.30 and 7.30. The distribution of fixed acidity is slightly right skewed and there are some outliers in the higher range of roughly above 10.5 g/dm^3. There is a maxium of 14.20

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

All wines typically have a low pH level. Acids are produced through the fermentation process. The median value is 3.180, and most wines have a pH between 3.090 and 3.280.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

A zoomed in view:

The density of the observations varies only a little, with most of the values being between 0.9917 and 0.9961. This would make sense, as wine has a density close to that of water. The distributions median value is 0.9937 g/cm^3.

Quality

##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

It appears that the distribution of wine quality appears to be normal with many wines at an average quality rating of 5 or 6. There are no wines with a quality lower than 3 and no wines higher than a quality rating of 9.

Univariate Analysis

What is the structure of your dataset?

The dataset has 12 variables regarding 4898 observations. Each observation corresponds to a white wine sample of the Portuguese “Vinho Verde”. Of the variables, 11 correspond to the results of a physicochemical test and one variable (quality) corresponds to the result of a sensory panel rating by wine experts.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in the dataset is the quality rating of each sample.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The physicochemical test results may help support the investigation into the dataset. All of them are related to characteristics which may affect the flavor profile of the wine. They correspond to concentrations of molecules which may have an overall impact on taste, and by extension, the quality rating of the wine. Density is a physical property which will depend on the percentage of alcohol and sugar content, which will also affect taste of the wine.

Some variables may have a stronger correlation with each other. For instance, the pH will depend on the amount of acid concentration, while total sulfur dioxide may have a similar distribution to that of free sulfur dioxide levels.

Did you create any new variables from existing variables in the dataset?

No new variables were created in the dataset for this analysis.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There were no unusual distributions. There were also no missing values and no need to adjust the data. It was already a tidy dataset. There were some outliers in the data that were noted, and these might have been due to an input error when recording the data.

Bivariate exploration and plots of the data

Alcohol vs. Quality

## [1] "Median of alcohol by quality:"
## wine$quality: 3
## [1] 10.45
## ------------------------------------------------------------ 
## wine$quality: 4
## [1] 10.1
## ------------------------------------------------------------ 
## wine$quality: 5
## [1] 9.5
## ------------------------------------------------------------ 
## wine$quality: 6
## [1] 10.5
## ------------------------------------------------------------ 
## wine$quality: 7
## [1] 11.4
## ------------------------------------------------------------ 
## wine$quality: 8
## [1] 12
## ------------------------------------------------------------ 
## wine$quality: 9
## [1] 12.5

Besides the small downward dip in the quality at the 5 rating level, the higher the alcohol content, the higher rating the wine seems to be given.

## 
##  Pearson's product-moment correlation
## 
## data:  quality_as_int and alcohol
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747

As we can see from the Pearson correlation test, there is a decent positive correlation between the alcohol content of a wine sample and what quality rating it receives.

Residual Sugar vs. Quality

## [1] "Median of residual.sugar by quality:"
## wine$quality: 3
## [1] 4.6
## ------------------------------------------------------------ 
## wine$quality: 4
## [1] 2.5
## ------------------------------------------------------------ 
## wine$quality: 5
## [1] 7
## ------------------------------------------------------------ 
## wine$quality: 6
## [1] 5.3
## ------------------------------------------------------------ 
## wine$quality: 7
## [1] 3.65
## ------------------------------------------------------------ 
## wine$quality: 8
## [1] 4.3
## ------------------------------------------------------------ 
## wine$quality: 9
## [1] 2.2

Now taking a closer look by limiting the Y axis:

Residual sugar seems to have a low impact on the quality rating of the wines. It is interesting that at the rating level of 9, the residual sugar tends to be lower than at the rating level of 3, even though for the mid ranged rating levels it tends to go up.

## 
##  Pearson's product-moment correlation
## 
## data:  quality_as_int and residual.sugar
## t = -6.8603, df = 4896, p-value = 7.724e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.12524103 -0.06976101
## sample estimates:
##         cor 
## -0.09757683

The correlation test shows similar insights into the data.

Chlorides vs. Quality

## [1] "Median of chlorides by quality:"
## wine$quality: 3
## [1] 0.041
## ------------------------------------------------------------ 
## wine$quality: 4
## [1] 0.046
## ------------------------------------------------------------ 
## wine$quality: 5
## [1] 0.047
## ------------------------------------------------------------ 
## wine$quality: 6
## [1] 0.043
## ------------------------------------------------------------ 
## wine$quality: 7
## [1] 0.037
## ------------------------------------------------------------ 
## wine$quality: 8
## [1] 0.036
## ------------------------------------------------------------ 
## wine$quality: 9
## [1] 0.031

Now looking at it a little closer:

Possibly a slight relation. Seems that less chlorides could mean a higher quality wine rating. Interesting that it seems to first be a positive relation up until the level 5 rating, then begin declining as a possible negative relationship.

## 
##  Pearson's product-moment correlation
## 
## data:  quality_as_int and chlorides
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2365501 -0.1830039
## sample estimates:
##        cor 
## -0.2099344

The correlation test shows similar insights into the data, with a slight negative correlation found.

Sulphates vs. Quality

## [1] "Median of sulphates by quality:"
## wine$quality: 3
## [1] 0.44
## ------------------------------------------------------------ 
## wine$quality: 4
## [1] 0.47
## ------------------------------------------------------------ 
## wine$quality: 5
## [1] 0.47
## ------------------------------------------------------------ 
## wine$quality: 6
## [1] 0.48
## ------------------------------------------------------------ 
## wine$quality: 7
## [1] 0.48
## ------------------------------------------------------------ 
## wine$quality: 8
## [1] 0.46
## ------------------------------------------------------------ 
## wine$quality: 9
## [1] 0.46

Now taking a more zoomed in look at the trend.

There seems to be very little relationship between sulphates and quality.

## 
##  Pearson's product-moment correlation
## 
## data:  quality_as_int and sulphates
## t = 3.7613, df = 4896, p-value = 0.000171
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02571007 0.08156172
## sample estimates:
##        cor 
## 0.05367788

Very little correlation is found.

Free sulfur dioxide vs. Quality

## [1] "Median of free.sulfur.dioxide by quality:"
## wine$quality: 3
## [1] 33.5
## ------------------------------------------------------------ 
## wine$quality: 4
## [1] 18
## ------------------------------------------------------------ 
## wine$quality: 5
## [1] 35
## ------------------------------------------------------------ 
## wine$quality: 6
## [1] 34
## ------------------------------------------------------------ 
## wine$quality: 7
## [1] 33
## ------------------------------------------------------------ 
## wine$quality: 8
## [1] 35
## ------------------------------------------------------------ 
## wine$quality: 9
## [1] 28

There seems to be little relation here.

According to the information that was provided with the dataset, when free SO2 is lower than 50 ppm (~ 50 mg/L), it is undetectable to humans. In the following plot there are very few wines that are above this level which suggests that the variations seen in this plot are not related to an effect of the free SO2, but to the unbalanced distribution of wines across the quality ratings.

So only a little correlation would be expected.

## 
##  Spearman's rank correlation rho
## 
## data:  quality_as_int and free.sulfur.dioxide
## S = 1.912e+10, p-value = 0.09703
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.02371338

And little correlation is found.

Total sulfur dioxide vs. Quality

## [1] "Median of total.sulfur.dioxide by quality:"
## wine$quality: 3
## [1] 159.5
## ------------------------------------------------------------ 
## wine$quality: 4
## [1] 117
## ------------------------------------------------------------ 
## wine$quality: 5
## [1] 151
## ------------------------------------------------------------ 
## wine$quality: 6
## [1] 132
## ------------------------------------------------------------ 
## wine$quality: 7
## [1] 122
## ------------------------------------------------------------ 
## wine$quality: 8
## [1] 122
## ------------------------------------------------------------ 
## wine$quality: 9
## [1] 119

Similar to free sulfur dioxide concentrations.

## 
##  Pearson's product-moment correlation
## 
## data:  quality_as_int and total.sulfur.dioxide
## t = -12.418, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2017563 -0.1474524
## sample estimates:
##        cor 
## -0.1747372

Interestingly, there is a bit more of a negative correlation found than comparatively to the correlation of that for free sulfur dioxide.

Citric Acid vs. Quality

## [1] "Median of citric.acid by quality:"
## wine$quality: 3
## [1] 0.345
## ------------------------------------------------------------ 
## wine$quality: 4
## [1] 0.29
## ------------------------------------------------------------ 
## wine$quality: 5
## [1] 0.32
## ------------------------------------------------------------ 
## wine$quality: 6
## [1] 0.32
## ------------------------------------------------------------ 
## wine$quality: 7
## [1] 0.31
## ------------------------------------------------------------ 
## wine$quality: 8
## [1] 0.32
## ------------------------------------------------------------ 
## wine$quality: 9
## [1] 0.36

A zoomed in look:

There seems to not be a relationship here.

## 
##  Pearson's product-moment correlation
## 
## data:  quality_as_int and citric.acid
## t = -0.6444, df = 4896, p-value = 0.5193
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03720595  0.01880221
## sample estimates:
##          cor 
## -0.009209091

The correlation test confirms the previous observation of the graph.

Volatile Acidity vs. Quality

## [1] "Median of volatile.acidity by quality:"
## wine$quality: 3
## [1] 0.26
## ------------------------------------------------------------ 
## wine$quality: 4
## [1] 0.32
## ------------------------------------------------------------ 
## wine$quality: 5
## [1] 0.28
## ------------------------------------------------------------ 
## wine$quality: 6
## [1] 0.25
## ------------------------------------------------------------ 
## wine$quality: 7
## [1] 0.25
## ------------------------------------------------------------ 
## wine$quality: 8
## [1] 0.26
## ------------------------------------------------------------ 
## wine$quality: 9
## [1] 0.27

And now a zoomed in look:

There seems to be a slight downward trend until the rating at level 9, which could be to a more limited sample size of quality level 9 wines.

## 
##  Pearson's product-moment correlation
## 
## data:  quality_as_int and volatile.acidity
## t = -13.891, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2215214 -0.1676307
## sample estimates:
##       cor 
## -0.194723

A very slight negative correlation is found.

Fixed Acidity vs. Quality

## [1] "Median of fixed.acidity by quality:"
## wine$quality: 3
## [1] 7.3
## ------------------------------------------------------------ 
## wine$quality: 4
## [1] 6.9
## ------------------------------------------------------------ 
## wine$quality: 5
## [1] 6.8
## ------------------------------------------------------------ 
## wine$quality: 6
## [1] 6.8
## ------------------------------------------------------------ 
## wine$quality: 7
## [1] 6.7
## ------------------------------------------------------------ 
## wine$quality: 8
## [1] 6.8
## ------------------------------------------------------------ 
## wine$quality: 9
## [1] 7.1

A closer look:

There is a slight trend of a higher quality rating when there is a lower fixed acidity concentration. However, there are less observations at the quality ratings of 3 and 9 compared to middle observations, which may make the median value not very accurate. Additionally, there is a big dispersion of acidity values across each quality scale.

## 
##  Pearson's product-moment correlation
## 
## data:  quality_as_int and fixed.acidity
## t = -8.005, df = 4896, p-value = 1.48e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.14121974 -0.08592991
## sample estimates:
##        cor 
## -0.1136628

Only a little negative correlattion is found.

pH vs. Quality

## [1] "Median of pH by quality:"
## wine$quality: 3
## [1] 3.215
## ------------------------------------------------------------ 
## wine$quality: 4
## [1] 3.16
## ------------------------------------------------------------ 
## wine$quality: 5
## [1] 3.16
## ------------------------------------------------------------ 
## wine$quality: 6
## [1] 3.18
## ------------------------------------------------------------ 
## wine$quality: 7
## [1] 3.2
## ------------------------------------------------------------ 
## wine$quality: 8
## [1] 3.23
## ------------------------------------------------------------ 
## wine$quality: 9
## [1] 3.28

And now a closer look:

There seems to be an upward trend here. This could mean that a higher acid concentration in the wine will correlate to a higher quality of wine. This relationship will be checked later on.

## 
##  Pearson's product-moment correlation
## 
## data:  quality_as_int and pH
## t = 6.9917, df = 4896, p-value = 3.081e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07162022 0.12707983
## sample estimates:
##        cor 
## 0.09942725

However, seems there is little correlation found.

Density vs. Quality

## [1] "Median of density by quality:"
## wine$quality: 3
## [1] 0.994425
## ------------------------------------------------------------ 
## wine$quality: 4
## [1] 0.9941
## ------------------------------------------------------------ 
## wine$quality: 5
## [1] 0.9953
## ------------------------------------------------------------ 
## wine$quality: 6
## [1] 0.99366
## ------------------------------------------------------------ 
## wine$quality: 7
## [1] 0.99176
## ------------------------------------------------------------ 
## wine$quality: 8
## [1] 0.99164
## ------------------------------------------------------------ 
## wine$quality: 9
## [1] 0.9903

And zooming in shows:

Lower density seems to mean a higher quality rating. There is a trend at the rating of a quality of 5 that breaks this trend slightly though. From the information provided with the dataset, it was stated that the density will depend on the percentage of alcohol and sugar content in the wine. This relationship will be checked later on, but seems like there could be a relation.

## 
##  Pearson's product-moment correlation
## 
## data:  quality_as_int and density
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3322718 -0.2815385
## sample estimates:
##        cor 
## -0.3071233

Decent negative correlation is found.

Alcohol vs. Residual Sugar

And now taking a zoomed look at it:

And its correlation test results:

## 
##  Pearson's product-moment correlation
## 
## data:  residual.sugar and alcohol
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4726723 -0.4280267
## sample estimates:
##        cor 
## -0.4506312

It was expected that a stronger correlation between the alcohol content and the residual sugar would be shown, since the alcohol should be coming from the fermentation of the sugars. However, this is still a decent negative correlation.

Possibly some of the wines are fortified with extra alcohol added after the fermentation process, or the yeast behaves in such a way that does not allow the data to establish a linear relationship between sugar fermentation and alcohol production. There is also the fact that the data does not mention which types of grapes were used, which may have different sugar contents that could impact this relationship.

Sulphates vs. Total Sulfur Dioxide

And now taking a zoomed look at it:

And its correlation test results:

## 
##  Pearson's product-moment correlation
## 
## data:  sulphates and total.sulfur.dioxide
## t = 9.5019, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1069590 0.1619585
## sample estimates:
##       cor 
## 0.1345624

Seems that the addition of the sulphate additive does not have a large correlation to the total sulfur dioxide in the wine samples.

Sulphates vs. Free Sulfur Dioxide

And now taking a zoomed look at it:

And its correlation test results:

## 
##  Pearson's product-moment correlation
## 
## data:  sulphates and free.sulfur.dioxide
## t = 4.1508, df = 4896, p-value = 3.369e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03126264 0.08707928
## sample estimates:
##        cor 
## 0.05921725

Seems that the addition of the sulphate additive does not have a correlation to the free sulfur dioxide in the white wine samples.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The wine quality rating is higher, and has a stronger relationship, with the 3 variables of chlorides, density and alcohol content. There are two others that are noteworthy, but not as impactful, are the variables of Total Sulfur Dioxide and Volatile Acidity. The correlation coefficients show that the strength of the relationship with the variables is shown below.

##                             [,1]
## alcohol               0.44036918
## residual.sugar       -0.08206979
## chlorides            -0.31448848
## sulphates             0.03331897
## free.sulfur.dioxide   0.02371338
## total.sulfur.dioxide -0.19668029
## citric.acid           0.01833273
## volatile.acidity     -0.19656168
## fixed.acidity        -0.08448545
## pH                    0.10936208
## density              -0.34835102

There is a negative correlation for density, which makes sense as alcohol would have an inverse relationship to density. And alcohol makes sense as a large contributor to the quality of a wine as well. Chlorides having a negative relation would make a wine sample less salty as its concentration goes down, which could explain how higher rated wines would have a lower concentration of chlorides. Total Sulfur Dioxide is interesting, as is volatile acidity.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The expected relationship between the alcohol level and density was as expected.

It was of interest to observe the relationship between chlorides and quality.

It was unexpected to not find a stronger relationship between the residual sugar and alcohol concentration, since the alcohol should in theory come from the fermentation of sugars in the wine making process.

What was the strongest relationship you found?

The correlation coefficients show that the variable with the strongest relationship with quality rating is the alcohol concentration.

Multivariate exploration and plots of the data

Correlation Matrix

Here is a correlation matrix of the data:

Alcohol, Density and Quality

Quality strongly correlates with alcohol content. And density should go down as the alcohol goes up. So as density decreases, alcohol goes up and the quality rating goes up for the wine in general.

The lowest quality wines have a low alcohol and high density. The middle quality wines (rated 5 and 6) can seem to be found spread throughout the plot area, but more quality level 3 can been seen on the left side of the graph, and more blue (ratings of 8 or 9) towards the right side of the graph.

Alcohol, Chlorides and Quality

The trend does seem to be that as the chloride concentration goes down and the alcohol concentration goes up, the quality increases.

Alcohol, Total Sulfur Dioxide and Quality

The total sulfur dioxide does not have much effect it seems on the quality of the wine.

Alcohol, Volatile Acidity and Quality

The volatile acidity of the wine does not seem to have too much of an impact, although it looks like generally the lower the volatile acidity, the better qualaty rating the wine will receive.

Chlorides, Volatile Acidity and Quality

Looks like if chlorides are lower, and volatile acidity is lower, the better quality rating the wine will be given.

Chlorides, Total Sulfur Dioxide and Quality

Lower chlorides and total sulfur dioxide levels between 100 to 200 mg/dm^3 seem to be where the higher quality rated wines samples will fall.

Chlorides, Density and Quality

A lower density and lower chlorides tend to a higher quality wine rating.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The main relationships explored was between the biggest correlations with quality.

It has been shown how alcohol, chlorides, density, total sulfur dioxide, and volatile acidity relate to quality. Higher alcohol, lower density, and low chloride concentration will typically give a better wine rating for quality.

There tends to be a range of total sulfur dioxide between 100 to 200 that also gives a better quality of wine.

Final Plots and Summary

Plot One

As chlorides go down, there is an increase in the quality rating of the wines. It is not a very strong relationship, but it is noteworthy nonetheless.

Plot Two

Here is the distribution of alcohol across the different quality ratings. The boxplot shows the quantile boundaries and median values, while the overlapping dots show the actual distribution of the wine samples. There is an odd decline in quality ratings from 3 to 5 as alcohol concentration goes down, but then it significantly goes up from there. There is an unbalanced amount of samples between the middle ratings and the higher and lower quality ratings. There are much more middle quality rated wines than there are low and high quality rated wines. The line connects the median values and helps to show visually the increasing trend of alcohol with quality rating.

Plot Three

In the analysis, it has been shown that it appears that alcohol and chlorides play a significant role on the quality rating a wine will receive. Using this plot, it can be seen that there does seem to be a trend that the lower a wine has of a concentration of chlorides and the higher its concentration of alcohol, the higher it is likely the wine sample will rate for its quality rating. There is also visible the inverse trend, that wine samples with a high concentration of chloride and a low alcohol concentration will have a lower rating for that wine sample.

Reflection

Working with datasets has the challenge of deciding how to approach the exploration of the data. Because this dataset also came with a description file, it already outlined some possible variables that might lend themselves to exploration. This proved quite useful. For example, when the description file said that citric acid could add a freshness element to the taste of wines, while acetic acid would add an almost vinegar like taste, this provided context to many of the observations that would follow. Another example would be the density of wine being close to water, but lower as its alcohol content increased. This then explained the inverse relationship seen between the variables for density and alcohol in the graphs. This also shows just how important it is to have some knowledge of the subject matter when beginning an analysis, as some contexts could be lost on a casual observer. This knowledge helps structure the approach to the analysis, and allows better theory formation so that meaningful insights can be gotten from the data processing.

Another challenge faced was figuring out how to communicate with the multivariate plots. When adding a third dimension to a plot, mentally visualizing that can become harder for individuals. Use of color helped with this challenging approach, which made it easier to grasp what information was being communicated by each step of the analysis, adding in clarity and depth of information.Using the correlation matrix was a neat addition that was quite welcome.This also helped to narrow down which variables should be focused on for further exploration. Overall, data needs to be communicated in such a way that its story can easily be understood and followed for the reader, so putting extra effort into making it legible proved a good use of time. The dataset already being clean and tidy made working with it significantly easier as a whole as well.

As a way to expand the analysis, bringing in other different types of wines would be an interesting way to see if the trends are strengthened, or weakened, with this new data. Expanding the dataset with the same type of wine, but simply having more observations would also be interesting, as the dataset used in this analysis was not very large. It would also be of interest if some additional variables could be added to the total dataset, such as type of grape, location grown, and how long before the grape was harvested.

In summary, having found and explored the main relationships in the dataset, using these trends to predict how other wines would fare as to their quality rating would be a logical next step. Then gathering data from those observations made on the predicted trends could be made to further refine the process of the prediction in the future.

References

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

StackOverflow website for various questions and research. Available at: http://www.Stackoverflow.com

R Bloggers website for various guides to using R. Available at: http://www.r-bloggers.com

R Markdown website by RStudio Available at: https://rmarkdown.rstudio.com/authoring_pandoc_markdown.html